DAY[25]-Kaggle實戰特徵處理(2)

第 11 屆 iThome 鐵人賽

DAY 25

AI & Data

Python機器學習介紹與實戰系列第 25 篇

11th鐵人賽 python3 machine learning

Austin

團隊Bikini Bottom

2019-10-10 19:45:58

1493 瀏覽

分享至

將特徵都整理的差不多之後，由於當初我們合併了Train以及Test兩個資料集，要在最後將原先的資料切割開，並簡單處理一下離群值。

# y為測試集
X = final_features.iloc[:len(y), :]
X_sub = final_features.iloc[len(y):, :]
X.shape, y.shape, X_sub.shape

觀察資料的過程中可以找到outlier的index

outliers = [30, 88, 462, 631, 1322]
X = X.drop(X.index[outliers])
y = y.drop(y.index[outliers])

overfit = []

# 刪除資料中大多數為0的特徵
for i in X.columns:
    counts = X[i].value_counts()
    zeros = counts.iloc[0]
    if zeros / len(X) * 100 > 99.94:
        overfit.append(i)

overfit = list(overfit)
X = X.drop(overfit, axis=1)
X_sub = X_sub.drop(overfit, axis=1)
overfit

簡單觀察一下最終整理的結果~